df = pd.read_csv("/Users/amritdhillon/Desktop/GSB544/Week 9/cannabis_full.csv")print("Shape:", df.shape)print("\nCannabis Type value counts:")print(df["Type"].value_counts())cc = df.drop(columns = ["Effects", "Flavor", "Strain"])cc = cc.dropna()print("\nCleaned shape:", cc.shape)print("\nCleaned Type value counts:")print(cc["Type"].value_counts())
Shape: (2351, 69)
Cannabis Type value counts:
Type
hybrid 1212
indica 699
sativa 440
Name: count, dtype: int64
Cleaned shape: (2305, 66)
Cleaned Type value counts:
Type
hybrid 1187
indica 687
sativa 431
Name: count, dtype: int64
Part One: Binary Classification
Data Cleaning
Code
#keeps only Sativa and Indica strainscc2 = cc[cc["Type"].isin(["sativa", "indica"])].copy()print("Binary dataset shape:", cc2.shape)print("\nType value counts (binary dataset):")print(cc2["Type"].value_counts())#predictors and targetX_bin = cc2.drop(columns=["Type"])y_bin = cc2["Type"]print("\nDtypes of predictors after conversion:")print(X_bin.dtypes.value_counts())
Binary dataset shape: (1118, 66)
Type value counts (binary dataset):
Type
indica 687
sativa 431
Name: count, dtype: int64
Dtypes of predictors after conversion:
float64 65
Name: count, dtype: int64
I chose accuracy as my scoring metric since there is no “correct” category/class. Neither Indica or Sativa is the “correct” class, so I felt that overall accuracy made the most sense to use as a metric. The cross validation predictions were fairly similar to the predictions made by the final fitted model.
Q1: LDA
Code
lda = LinearDiscriminantAnalysis()#5 fold cross validation using accuracy as scoring metriclda_cv_scores = cross_val_score(lda, X_bin, y_bin, cv=5, scoring="accuracy")print("\nLDA cross-validated accuracy scores:", lda_cv_scores)print("Mean CV accuracy:", lda_cv_scores.mean())#Cross validated predictions for confusion matrixy_pred_cv = cross_val_predict(lda, X_bin, y_bin, cv=5)print("\nConfusion matrix (cross-validated predictions):")print(confusion_matrix(y_bin, y_pred_cv, labels=["indica", "sativa"]))cm = confusion_matrix(y_bin, y_pred_cv)ldacv_cm = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=["indica", "sativa"])ldacv_cm.plot()plt.title('LDA Cross Validated Predictions - Confusion Matrix')plt.show()#final LDA model on the full datasetlda_final = LinearDiscriminantAnalysis()lda_final.fit(X_bin, y_bin)#for comparisony_pred_final = lda_final.predict(X_bin)print("\nConfusion matrix (final model predictions):")print(confusion_matrix(y_bin, y_pred_final, labels=["indica", "sativa"]))cm2 = confusion_matrix(y_bin, y_pred_final)ldaf_cm = ConfusionMatrixDisplay(confusion_matrix=cm2,display_labels=["indica", "sativa"])ldaf_cm.plot()plt.title('LDA Final Predictions - Confusion Matrix')plt.show()print("\nFinal model accuracy:", accuracy_score(y_bin, y_pred_final))
Now use the full dataset, including the Hybrid strains.
Q1: Fit a decision tree, plot the final fit, and interpret the results.
Code
cc_full = cc.copy()X_full = cc_full.drop(columns=["Type"])y_full = cc_full["Type"]tree = DecisionTreeClassifier(random_state=123)tree_cv_scores = cross_val_score(tree, X_full, y_full, cv=5, scoring="accuracy")print("Decision Tree cross validated accuracy:", tree_cv_scores)print("Mean Cross Validated accuracy:", tree_cv_scores.mean())#fit tree on full datatree.fit(X_full, y_full)plt.figure(figsize=(16, 10))plot_tree(tree, feature_names=X_full.columns, class_names=tree.classes_,filled=True)plt.title("Decision Tree")plt.show()#interpret resultstree_pred_final = tree.predict(X_full)#decision tree model will definitely overfit, but i included to follow same formatting as prior problemprint("\nFinal model accuracy (Decision Tree):", accuracy_score(y_full, tree_pred_final))cm9 = confusion_matrix(y_full, tree_pred_final, labels=["indica", "sativa", "hybrid"])disp = ConfusionMatrixDisplay(confusion_matrix=cm9, display_labels=["indica", "sativa", "hybrid"])disp.plot()plt.title("Decision Tree – Final Model Confusion Matrix")plt.show()
Decision Tree cross validated accuracy: [0.47288503 0.50976139 0.51193059 0.50759219 0.4967462 ]
Mean Cross Validated accuracy: 0.4997830802603037
Final model accuracy (Decision Tree): 0.9765726681127983
The mean cross validated accuracy of about .4997 for the Decision Tree makes sense as it has a difficult time with multiclass classification, so the accuracy drops significantly.
There are many dummy variables within the dataset which creates many binary columns for the Decision Tree model to analyze and it confuses the tree, so the roughly 50% accuracy is normal.
The final model accuracy makes sense as well as Decision Trees tend to overfit on data like this and memorized the patterns in the training dataset too closely.
Q2: Repeat the analyses from Part One for LDA, QDA, and KNN.
LDA
Code
lda2 = LinearDiscriminantAnalysis()#5 fold cross validation using accuracy as scoring metriclda_cv_scores2 = cross_val_score(lda2, X_full, y_full, cv=5, scoring="accuracy")print("\nLDA cross-validated accuracy scores:", lda_cv_scores2)print("Mean CV accuracy:", lda_cv_scores2.mean())#Cross validated predictions for confusion matrixy_pred_cv2 = cross_val_predict(lda2, X_full, y_full, cv=5)print("\nConfusion matrix (cross-validated predictions):")print(confusion_matrix(y_full, y_pred_cv2, labels=["indica", "sativa", "hybrid"]))cm10 = confusion_matrix(y_full, y_pred_cv2)ldacv_cm2 = ConfusionMatrixDisplay(confusion_matrix=cm10,display_labels=["indica", "sativa", "hybrid"])ldacv_cm2.plot()plt.title('LDA Cross Validated Predictions - Confusion Matrix')plt.show()#final LDA model on the full datasetlda_final2 = LinearDiscriminantAnalysis()lda_final2.fit(X_full, y_full)#for comparisony_pred_final2 = lda_final2.predict(X_full)print("\nConfusion matrix (final model predictions):")print(confusion_matrix(y_full, y_pred_final2, labels=["indica", "sativa", "hybrid"]))cm11 = confusion_matrix(y_full, y_pred_final2)ldaf_cm2 = ConfusionMatrixDisplay(confusion_matrix=cm11,display_labels=["indica", "sativa", "hybrid"])ldaf_cm2.plot()plt.title('LDA Final Predictions - Confusion Matrix')plt.show()print("\nFinal model accuracy:", accuracy_score(y_full, y_pred_final2))
Best parameters for KNN: {'n_neighbors': 11}
Best cross validated accuracy: 0.5908893709327548
Confusion matrix (cross validated predictions):
[[375 5 307]
[ 21 105 305]
[220 85 882]]
Final model accuracy (KNN): 0.6555314533622559
Confusion matrix (final KNN model):
[[425 3 259]
[ 23 137 271]
[168 70 949]]
Q3: Were your metrics better or worse than in Part One? Why? Which categories were most likely to get mixed up, according to the confusion matrices? Why?
Code
results_compare = pd.DataFrame({"Model": ["LDA", "QDA", "SVC (Linear)", "SVM (Poly)", "KNN"],#Part One (Binary Classification)"Part 1 CV Accuracy": [ lda_cv_scores.mean(), qda_cv_scores.mean(), svc_grid.best_score_, svm_grid.best_score_,None#KNN not used in Part One ],#Part Two (Multiclass)"Part 2 CV Accuracy": [ lda_cv_scores2.mean(), qda_cv_scores2.mean(),None, #SVC not used in Part TwoNone, #SVM not used in Part Two knn_grid.best_score_ ],"Part 1 Final Accuracy": [ accuracy_score(y_bin, y_pred_final), accuracy_score(y_bin, qda_y_pred_final), accuracy_score(y_bin, svc_y_pred_final), accuracy_score(y_bin, svm_y_pred_final),None#KNN not used in Part One ],"Part 2 Final Accuracy": [ accuracy_score(y_full, y_pred_final2), accuracy_score(y_full, qda_y_pred_final2),None, #SVC not used in Part TwoNone, #SVM not used in Part Two accuracy_score(y_full, knn_y_pred_final) ]});results_compare
Model
Part 1 CV Accuracy
Part 2 CV Accuracy
Part 1 Final Accuracy
Part 2 Final Accuracy
0
LDA
0.842581
0.629067
0.869410
0.642950
1
QDA
0.431086
0.221258
0.410555
0.208243
2
SVC (Linear)
0.852410
NaN
0.865832
NaN
3
SVM (Poly)
0.855093
NaN
0.907871
NaN
4
KNN
NaN
0.590889
NaN
0.655531
Across every model where comparable across both parts(LDA, QDA) the accuracy was much worse in part two(multiclass), as well as in general for KNN where accuracy was worse than all previous models besides QDA.
This drop in accuracy is expected as multiclassification is more difficult for the models than binary classification.
Whereas in the binary classification models, models only needed to separate two classes(like Indica vs Sativa), in multiclassification models they must distinguish Indica vs Sativa vs Hybrid creating several issues such as more decision boundaries and greater potential for overlap.
Across all confusion matrices in Part Two, the Hybrid strain class was consistently the hardest to classify correctly.
The Hybrid strains share attributes with both Indica and Sativa strains, making them signficantly harder to classify. Due to this issue the overall model accuracy drops as they have difficulty distinguishing the Hybrid strain with unclear class atrributes. Due to their combining of both relaxing and uplifting type results from the different strains their dummy variable patterns are similar to both Indica and Sativa.
Furthermore, the dataset contains far more Hybrid strains than either Indica or Sativa, which leads models to have bias towards predicting a strain as a Hybrid due to their dominance.
While in the binary classification models, strains like Indica vs Sativa were more easily classified due to their different effects, adding a third intermediate class like Hybrid reduces overall classification accuracy.
Part Three: Multiclass from Binary
Q1: Fit and report metrics for OvR versions of the models. That is, for each of the two model types, create three models:
svc3 = SVC(kernel="linear")param_grid = {"C": [0.01, 0.1, 1, 5, 10]}grid = GridSearchCV(svc3, param_grid, cv=5, scoring="accuracy")grid.fit(X_full, y_indica)print("Best C:", grid.best_params_)print("Best CV Accuracy:", grid.best_score_)svc_indica_cv = grid.best_score_#Cross validated predictions with best modelbest_svc = grid.best_estimator_y_pred_cv = cross_val_predict(best_svc, X_full, y_indica, cv=5)print("Confusion Matrix (CV):")cm_indica_cv = confusion_matrix(y_indica, y_pred_cv)print(cm_indica_cv)disp = ConfusionMatrixDisplay(confusion_matrix=cm_indica_cv, display_labels=["Not Indica", "Indica"])disp.plot()plt.title("SVC OvR Confusion Matrix – Not Indica vs Indica (CV)")plt.show()#model on full datafinal_svc = best_svcfinal_svc.fit(X_full, y_indica)final_pred = final_svc.predict(X_full)print("Final Accuracy (Indica vs Not Indica):", accuracy_score(y_indica, final_pred))svc_indica_final = accuracy_score(y_indica, final_pred)cm_indica_final = confusion_matrix(y_indica, final_pred)print("Confusion Matrix (Final Model):")print(cm_indica_final)disp_final = ConfusionMatrixDisplay(confusion_matrix=cm_indica_final,display_labels=["Not Indica", "Indica"])disp_final.plot()plt.title("SVC OvR Confusion Matrix – Not Indica vs Indica (Final)")plt.show()
Best C: {'C': 5}
Best CV Accuracy: 0.7887201735357918
Confusion Matrix (CV):
[[1361 257]
[ 230 457]]
Final Accuracy (Indica vs Not Indica): 0.7908893709327549
Confusion Matrix (Final Model):
[[1362 256]
[ 226 461]]
Sativa vs. Not Sativa
Code
grid = GridSearchCV(svc3, param_grid, cv=5, scoring="accuracy")grid.fit(X_full, y_sativa)print("Best C:", grid.best_params_)print("Best CV Accuracy:", grid.best_score_)svc_sativa_cv = grid.best_score_#Cross validated predictions with best modelbest_svc = grid.best_estimator_y_pred_cv = cross_val_predict(best_svc, X_full, y_sativa, cv=5)print("Confusion Matrix (CV):")cm_sativa_cv = confusion_matrix(y_sativa, y_pred_cv)print(cm_sativa_cv)disp = ConfusionMatrixDisplay(confusion_matrix=cm_sativa_cv, display_labels=["Not Sativa", "Sativa"])disp.plot()plt.title("SVC OvR Confusion Matrix – Not Sativa vs Sativa (CV)")plt.show()#model on full datafinal_svc = best_svcfinal_svc.fit(X_full, y_sativa)final_pred = final_svc.predict(X_full)print("Final Accuracy (Sativa vs Not Sativa):", accuracy_score(y_sativa, final_pred))svc_sativa_final = accuracy_score(y_sativa, final_pred)cm_sativa_final = confusion_matrix(y_sativa, final_pred)print("Confusion Matrix (Final Model):")print(cm_sativa_final)disp_final = ConfusionMatrixDisplay(confusion_matrix=cm_sativa_final,display_labels=["Not Sativa", "Sativa"])disp_final.plot()plt.title("SVC OvR Confusion Matrix – Not Sativa vs Sativa (Final)")plt.show()
Best C: {'C': 5}
Best CV Accuracy: 0.8190889370932755
Confusion Matrix (CV):
[[1817 57]
[ 360 71]]
Final Accuracy (Sativa vs Not Sativa): 0.8134490238611713
Confusion Matrix (Final Model):
[[1870 4]
[ 426 5]]
Hybrid vs. Not Hybrid
Code
grid.fit(X_full, y_hybrid)print("Best C:", grid.best_params_)print("Best CV Accuracy:", grid.best_score_)svc_hybrid_cv = grid.best_score_#Cross validated predictions with best modelbest_svc = grid.best_estimator_y_pred_cv = cross_val_predict(best_svc, X_full, y_hybrid, cv=5)print("Confusion Matrix (CV):")cm_hybrid_cv = confusion_matrix(y_hybrid, y_pred_cv)print(cm_hybrid_cv)disp = ConfusionMatrixDisplay(confusion_matrix=cm_hybrid_cv, display_labels=["Not Hybrid", "Hybrid"])disp.plot()plt.title("SVC OvR Confusion Matrix – Not Hybrid vs Hybrid (CV)")plt.show()#model on full datafinal_svc = best_svcfinal_svc.fit(X_full, y_hybrid)final_pred = final_svc.predict(X_full)print("Final Accuracy (Hybrid vs Not Hybrid):", accuracy_score(y_hybrid, final_pred))svc_hybrid_final = accuracy_score(y_hybrid, final_pred)cm_hybrid_final = confusion_matrix(y_hybrid, final_pred)print("Confusion Matrix (Final Model):")print(cm_hybrid_final)disp_final = ConfusionMatrixDisplay(confusion_matrix=cm_hybrid_final,display_labels=["Not Hybrid", "Hybrid"])disp_final.plot()plt.title("SVC OvR Confusion Matrix – Not Hybrid vs Hybrid (Final)")plt.show()
Best C: {'C': 0.1}
Best CV Accuracy: 0.6247288503253796
Confusion Matrix (CV):
[[495 623]
[242 945]]
Final Accuracy (Hybrid vs Not Hybrid): 0.6281995661605206
Confusion Matrix (Final Model):
[[505 613]
[244 943]]
Logistic RegressionIndica vs. Not Indica
Code
log3 = LogisticRegression()param_grid = {"C": [0.01, 0.1, 1, 5, 10]}grid = GridSearchCV(log3, param_grid, cv=5, scoring="accuracy")grid.fit(X_full, y_indica)print("Best C:", grid.best_params_)print("Best CV Accuracy:", grid.best_score_)lr_indica_cv = grid.best_score_#Cross validated predictions with best modelbest_log = grid.best_estimator_y_pred_cv = cross_val_predict(best_log, X_full, y_indica, cv=5)print("Confusion Matrix (CV):")cm_indica_cv = confusion_matrix(y_indica, y_pred_cv)print(cm_indica_cv)disp = ConfusionMatrixDisplay(confusion_matrix=cm_indica_cv, display_labels=["Not Indica", "Indica"])disp.plot()plt.title("Logistic Regression OvR Confusion Matrix – Not Indica vs Indica (CV)")plt.show()#model on full datafinal_log = best_logfinal_log.fit(X_full, y_indica)final_pred = final_log.predict(X_full)print("Final Accuracy (Indica vs Not Indica):", accuracy_score(y_indica, final_pred))lr_indica_final = accuracy_score(y_indica, final_pred)cm_indica_final = confusion_matrix(y_indica, final_pred)print("Confusion Matrix (Final Model):")print(cm_indica_final)disp_final = ConfusionMatrixDisplay(confusion_matrix=cm_indica_final,display_labels=["Not Indica", "Indica"])disp_final.plot()plt.title("Logistic Regression OvR Confusion Matrix – Not Indica vs Indica (Final)")plt.show()
Best C: {'C': 0.1}
Best CV Accuracy: 0.7991323210412149
Confusion Matrix (CV):
[[1433 185]
[ 278 409]]
Final Accuracy (Indica vs Not Indica): 0.8073752711496747
Confusion Matrix (Final Model):
[[1441 177]
[ 267 420]]
Sativa vs. Not Sativa
Code
grid.fit(X_full, y_sativa)print("Best C:", grid.best_params_)print("Best CV Accuracy:", grid.best_score_)lr_sativa_cv = grid.best_score_#Cross validated predictions with best modelbest_log = grid.best_estimator_y_pred_cv = cross_val_predict(best_log, X_full, y_sativa, cv=5)print("Confusion Matrix (CV):")cm_sativa_cv = confusion_matrix(y_sativa, y_pred_cv)print(cm_sativa_cv)disp = ConfusionMatrixDisplay(confusion_matrix=cm_sativa_cv, display_labels=["Not Sativa", "Sativa"])disp.plot()plt.title("Logistic Regression OvR Confusion Matrix – Not Sativa vs Sativa (CV)")plt.show()#model on full datafinal_log = best_logfinal_log.fit(X_full, y_sativa)final_pred = final_log.predict(X_full)print("Final Accuracy (Sativa vs Not Sativa):", accuracy_score(y_sativa, final_pred))lr_sativa_final = accuracy_score(y_sativa, final_pred)cm_sativa_final = confusion_matrix(y_sativa, final_pred)print("Confusion Matrix (Final Model):")print(cm_sativa_final)disp_final = ConfusionMatrixDisplay(confusion_matrix=cm_sativa_final,display_labels=["Not Sativa", "Sativa"])disp_final.plot()plt.title("Logistic Regression OvR Confusion Matrix – Not Sativa vs Sativa (Final)")plt.show()
Best C: {'C': 1}
Best CV Accuracy: 0.8281995661605206
Confusion Matrix (CV):
[[1777 97]
[ 299 132]]
Final Accuracy (Sativa vs Not Sativa): 0.8360086767895879
Confusion Matrix (Final Model):
[[1786 88]
[ 290 141]]
Hybrid vs. Not Hybrid
Code
grid.fit(X_full, y_hybrid)print("Best C:", grid.best_params_)print("Best CV Accuracy:", grid.best_score_)lr_hybrid_cv = grid.best_score_#Cross validated predictions with best modelbest_log = grid.best_estimator_y_pred_cv = cross_val_predict(best_log, X_full, y_hybrid, cv=5)print("Confusion Matrix (CV):")cm_hybrid_cv = confusion_matrix(y_hybrid, y_pred_cv)print(cm_hybrid_cv)disp = ConfusionMatrixDisplay(confusion_matrix=cm_hybrid_cv, display_labels=["Not Hybrid", "Hybrid"])disp.plot()plt.title("Logistic Regression OvR Confusion Matrix – Not Hybrid vs Hybrid (CV)")plt.show()#model on full datafinal_log = best_logfinal_log.fit(X_full, y_hybrid)final_pred = final_log.predict(X_full)print("Final Accuracy (Hybrid vs Not Hybrid):", accuracy_score(y_hybrid, final_pred))lr_hybrid_final = accuracy_score(y_hybrid, final_pred)cm_hybrid_final = confusion_matrix(y_hybrid, final_pred)print("Confusion Matrix (Final Model):")print(cm_hybrid_final)disp_final = ConfusionMatrixDisplay(confusion_matrix=cm_hybrid_final,display_labels=["Not Hybrid", "Hybrid"])disp_final.plot()plt.title("Logistic Regression OvR Confusion Matrix – Not Hybrid vs Hybrid (Final)")plt.show()
Best C: {'C': 0.1}
Best CV Accuracy: 0.6251626898047722
Confusion Matrix (CV):
[[605 513]
[351 836]]
Final Accuracy (Hybrid vs Not Hybrid): 0.6455531453362255
Confusion Matrix (Final Model):
[[619 499]
[318 869]]
Q2: Which of the six models did the best job distinguishing the target category from the rest? Which did the worst? Does this make intuitive sense?
Code
ovr_results = pd.DataFrame({"Model": ["SVC – Indica vs Not Indica","SVC – Sativa vs Not Sativa","SVC – Hybrid vs Not Hybrid","LogReg – Indica vs Not Indica","LogReg – Sativa vs Not Sativa","LogReg – Hybrid vs Not Hybrid" ],"Best CV Accuracy": [# From SVC OvR svc_indica_cv, # SVC Indica svc_sativa_cv, # SVC Sativa svc_hybrid_cv, # SVC Hybrid# From Logistic Regression OvR lr_indica_cv, # LogReg Indica lr_sativa_cv, # LogReg Sativa lr_hybrid_cv # LogReg Hybrid ],"Final Accuracy": [# Final SVC svc_indica_final, # SVC Indica final svc_sativa_final, # SVC Sativa final svc_hybrid_final, # SVC Hybrid final# Final Logistic Regression lr_indica_final, # LogReg Indica final lr_sativa_final, # LogReg Sativa final lr_hybrid_final # LogReg Hybrid final ]})ovr_results
Model
Best CV Accuracy
Final Accuracy
0
SVC – Indica vs Not Indica
0.788720
0.790889
1
SVC – Sativa vs Not Sativa
0.819089
0.813449
2
SVC – Hybrid vs Not Hybrid
0.624729
0.628200
3
LogReg – Indica vs Not Indica
0.799132
0.807375
4
LogReg – Sativa vs Not Sativa
0.828200
0.836009
5
LogReg – Hybrid vs Not Hybrid
0.625163
0.645553
The best performing model of these six models was the Logistic Regression(Sativa vs. Not Sativa) model with a cross validated model accuracy of 0.8282, while the SVC(Sativa vs. Not Sativa) was almost equally the best with a cross validated model accuracy of 0.8190.
This makes sense as Sativa has easily distinctive attributes such as feelings user get like “energetic” or “uplifted”.
Additionally, Sativa is the smallest class, so in combination with its distintive features, when being treated as the distinguishing class, Sativa is much more easily classified.
The worst performing model of these six models was the SVC(Hybrid vs. Not Hybrid) with a cross validated model accuracy of 0.624729, while the Logistic Regression(Hybrid vs. Not Hybrid) was almost equally the worst with a cross validated model accuracy of 0.6251.
This also makes sense as Hybrid attributes overlap heavily with attributes of both the Sativa and Indica classes, being that it is a mix of both strains. This makes decision boundaries fuzzy and difficult for the models to correctly classify.
Q3: Fit and report metrics for OvO versions of the models. That is, for each of the two model types, create three models:
Best C: {'C': 1}
Best CV Accuracy: 0.7521652715667162
Confusion Matrix (CV):
[[ 136 295]
[ 106 1081]]
Final Accuracy (Hybrid vs Sativa): 0.7682323856613102
Confusion Matrix (Final Model):
[[ 149 282]
[ 93 1094]]
Q4: Which of the six models did the best job distinguishing at differentiating the two groups? Which did the worst? Does this make intuitive sense?
Code
ovo_results = pd.DataFrame({"Model": ["SVC – Indica vs Sativa","SVC – Indica vs Hybrid","SVC – Hybrid vs Sativa","LogReg – Indica vs Sativa","LogReg – Indica vs Hybrid","LogReg – Hybrid vs Sativa" ],"Best CV Accuracy": [ svc_indica_vs_sativa_cv, svc_indica_vs_hybrid_cv, svc_hybrid_vs_sativa_cv, log_indica_vs_sativa_cv, log_indica_vs_hybrid_cv, log_hybrid_vs_sativa_cv ],"Final Accuracy": [ svc_indica_vs_sativa_final, svc_indica_vs_hybrid_final, svc_hybrid_vs_sativa_final, log_indica_vs_sativa_final, log_indica_vs_hybrid_final, log_hybrid_vs_sativa_final ]}); ovo_results
Model
Best CV Accuracy
Final Accuracy
0
SVC – Indica vs Sativa
0.852410
0.865832
1
SVC – Indica vs Hybrid
0.756145
0.757204
2
SVC – Hybrid vs Sativa
0.750292
0.770705
3
LogReg – Indica vs Sativa
0.855097
0.866726
4
LogReg – Indica vs Hybrid
0.762550
0.766275
5
LogReg – Hybrid vs Sativa
0.752165
0.768232
The best performing model of these six models was the Logistic Regression(Indica vs. Sativa) model with a cross validated model accuracy of 0.8551, while the SVC(Indica vs. Sativa) was almost equally the best with a cross validated model accuracy of 0.8524.
This result makes intuitive sense because Indica and Sativa are the two most distinct strain categories. Their effects tend to be more polarized, and therefore they produce clearer and more separable patterns in the effect/flavor dummy variables.
Since OvO isolates just these two classes, the models do not have to account for the Hybrid strains, which normally blur the boundary between Indica and Sativa. With that problem removed, both the linear SVC and Logistic Regression can find strong separating boundaries.
The worst performing model of these six models was the SVC(Hybrid vs. Sativa) with a cross validated model accuracy of 0.7503, while the Logistic Regression(Hybrid vs. Sativa) was almost equally the worst with a cross validated model accuracy of 0.7522.
This also makes sense because Hybrid strains overlap heavily with both Indica and Sativa, making them difficult to separate from either category in a pairwise comparison. In the Hybrid vs Sativa case, Hybrid strains often share several Sativa-like descriptors (like “uplifted”, “creative”, “energetic”), but not consistently enough to create a clean decision boundary.
Due to this, distinguishing Hybrid from Sativa is more challenging, producing the lowest OvO performance across models.
Q5: Suppose you had simply input the full data, with three classes, into the LogisticRegression function. Would this have automatically taken an “OvO” approach or an “OvR” approach? What about for SVC?
If I had chosen to simply input the dataset with all three classes(multiclass data) into Logistic Regression, it would have automatically used the “OvR” approach.
If I had chosen to simply input the dataset with all three clases(multiclass data) into SVC, it would have automatically used the “OvO” approach.